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Abstract 


This  paper  describes  the  motivation,  design  and  performance  of  Midway, 
a  programming  system  for  a  distributed  shared  memory  multicomputer 
(DSM)  such  as  an  ATM-based  cluster,  a  CM-5,  or  a  Paragon.  Midway 
supports  a  new  memory  consistency  model  called  entry  consistency. 
Entry  consistency  guarantees  that  shared  data  becomes  consistent  at  a 
processor  when  the  processor  acquires  a  synchronization  object  known  to 
guard  iitc  data.  Entry  consistency  is  weaker  than  other  models  described 
in  the  literature,  such  as  processor  consistency  and  release  consistency, 
but  it  makes  possible  higher  performance  implementations  of  the 
underlying  consistency  protocols.  Midway  programs  are  written  in  C, 
and  the  association  between  synchronization  objects  and  data  must  be 
made  with  explicit  annotations.  As  a  result,  pure  entry  consistent 
programs  can  require  more  annotations  than  programs  written  to  other 
models.  In  addition  to  entry  consistency,  Midway  also  supports  the 
stronger  release  consistent  and  processor  consistent  models  at  the 
granularity  of  individual  data  items.  Consequently,  the  programmer  can 
tradeoff  potentially  reduced  performance  for  the  additional  programming 
complexity  required  to  write  an  entry  consistent  parallel  program. 


DTIC  QuAl.IT 


Aocoenlen  For 

>  ('"  Ic-.l  W 

b‘.  -  -y-i  □ 

u-..  a-rc*  □ 


The  Midway  Distributed  Shared  Memory  System 


Abstract 

This  paper  describes  the  motivation,  design  and  per¬ 
formance  of  Midway,  a  programming  system  for  a  dis¬ 
tributed  shared  memory  multicomputer  (DSM)  such  as 
an  ATM-based  cluster,  a  CM-5,  or  a  Paragon.  Mid¬ 
way  supports  a  new  memory  consistency  model  called 
entry  consistency.  Entry  consistency  guarantees  that 
shared  data  becomes  consistent  at  a  processo  -  when  the 
processor  acquires  a  synchronisation  object  known  to 
guard  the  data.  Entry  consistency  is  weaker  than  other 
models  described  in  the  literature,  such  as  processor 
consistency  and  release  consistency,  it  makes  pos¬ 
sible  higher  performance  implementations  of  the  un¬ 
derlying  consistency  protocols.  Midway  programs  are 
written  in  C,  and  the  association  between  synchroniza¬ 
tion  objects  and  data  must  be  made  with  explicit  an¬ 
notations.  As  a  result,  pure  entry  consistent  programs 
can  require  more  annotations  than  programs  written  to 
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other  models.  In  addition  to  entry  consistency,  Mid¬ 
way  also  supports  the  stronger  release  consistent  and 
processor  consistent  models  at  the  granularity  of  indi¬ 
vidual  data  items.  Consequently,  the  programmer  can 
tradeoff  potentially  reduced  performance  for  the  addi¬ 
tional  programming  complexity  required  to  wnte  an  en¬ 
try  consistent  parallel  program. 

1  Introduction 

Midway  is  a  distributed  shared  memory  (DSM)  pro¬ 
gramming  system  supporting  multiple  memory  consis¬ 
tency  models  within  a  single  parallel  program.  Mid¬ 
way  is  intended  for  use  on  medium-scale  multicom¬ 
puters  (fewer  than  100  nodes),  such  as  an  ATM-based 
cluster  [Rider  89],  a  TMO  CM-5,  or  an  Intel  Paragon. 
In  addition  to  supporting  processor  consistency  and 
release  consistency,  Midway  supports  a  new  memory 
consistency  model  called  entry  consistency.  Entry 
consistency  guarantees  that  shared  data  becomes  con¬ 
sistent  at  a  processor  only  when  the  processor  acquires 
a  synchronization  object  that  guards  the  data.  Fur¬ 
thermore,  the  only  data  that  is  guaranteed  to  be  con¬ 
sistent  is  that  guarded  by  the  acquired  synchronization 
object.  This  allows  an  implementation  of  entry  con¬ 
sistency  to  reduce  the  frequency  of  global  communica¬ 
tion  by  exploiting  synchronization  patterns  between 
processors.  Midway’s  implementation  of  entry  con¬ 
sistency  requires  that  the  relationship  between  data 
and  synchronization  objects  (which  is  implicit  in  the 
structure  of  a  parallel  program)  be  made  explicit  to 
the  compiler  and  the  runtime  system. 

Midway  supports  multiple  consistency  models 
within  a  single  program  to  ease  the  task  of  construct¬ 
ing  a  program  that  runs  efficiently  on  a  DSM  system. 
A  program  running  under  Midway  may  contain  data 
that  is  processor  consistent,  release  consistent,  or  en- 


try  consistent.  Furthermore,  within  a  single  run  of  a 
program,  multiple  consistency  models  may  be  active 
at  the  same  time.  This  allows  the  programmer  to  be¬ 
gin  with  a  processor  consistent  parallel  program,  and 
then  selectively  relax  its  consistency  requirements  for 
shared  data  by  modifying  the  program  to  use  one  of 
the  weaker  models. 

1.1  Motivation 

A  wide  range  of  memory  consistency  models  ex¬ 
ists,  and  each  offers  a  different  guarantee  about  the 
strength  and  timeliness  with  which  updates  to  shared 
memory  take  effect  at  processors  distributed  through¬ 
out  a  network.  In  order  of  strength,  these  models 
include  sequential  consistency  [Lamport  79],  proces¬ 
sor  consistency  (Goodman  it  Woest  88],  weak  consis¬ 
tency  [Dubois  et  al.  86],  release  consistency  (Ghara- 
chorloo  et  al.  90]  and  entry  consistency,  which  is  de¬ 
scribed  in  this  paper.  In  order,  each  model  can  in¬ 
crease  a  processor’s  tolerance  for  latency  in  the  mem¬ 
ory  system  by  relaxing  the  rules  that  determine  the 
behavior  of  operations  which  write  to  shared  mem¬ 
ory.  Aggressive  implementations  of  the  weaker  mod¬ 
els  are  capable  of  delivering  higher  performance  than 
those  of  stronger  ones  because  they  better  tolerate 
network  delays  and  limited  bandwidth  [Gharachorloo 
et  al.  91,  Zucker  Sc  Baer  92]. 

Programmers  often  assume  that  memory  is  sequen¬ 
tially  consistent.  This  means  that  the  “result  of  any 
execution  is  the  same  as  if  the  operations  of  al!  the  pro¬ 
cessors  were  executed  in  some  sequential  order,  and 
the  operations  of  each  processor  appear  in  this  se¬ 
quence  in  the  order  specified  by  its  program"  [Lamport 
79].  In  a  sequentially  consistent  system,  one  proces¬ 
sor’s  update  to  a  shared  data  value  is  reflected  in  every 
other  processor’s  memory  before  the  updating  proces¬ 
sor  is  able  to  issue  another  memory  access.  Unfor¬ 
tunately,  sequentially  consistent  memory  systems  pre¬ 
clude  many  optimizations  such  as  reordering,  batch¬ 
ing,  or  coalescing.  These  optimizations  reduce  the  per¬ 
formance  impact  of  having  distributed  memories  with 
non-uniform  access  times  (Dubois  et  al.  86]. 

Memory  consistency  requirements  can  be  relaxed 
by  taking  advantage  of  the  fact  that  most  parallel 
programs  already  define  their  own  higher-level  con¬ 
sistency  requirements.  This  is  done  by  means  of  ex¬ 
plicit  synchronization  operations  such  as  lock  acqui¬ 
sition  and  barrier  entry.  These  operations  impose  an 
ordering  on  access  to  data  within  the  program.  In 
the  absence  of  such  operations,  a  multithreaded  pro¬ 
gram  is  in  effect  relinquishing  all  control  over  the  order 
and  atomicity  of  memory  operations  to  the  underlying 


memory  system. 

These  observations  about  explicit  synchroniza¬ 
tion  have  led  to  a  class  of  weakly  consistent  proto¬ 
cols  [Dubois  et  al.  86,  Scheurich  Sc  Dubois  87,  Adve 
Si  Hill  89,  Gharachorloo  et  al.  90).  Such  protocols 
distinguish  between  normal  shared  accesses  and  syn¬ 
chronization  accesses.  The  only  accesses  that  must 
execute  in  a  sequentially  consistent  order  are  those  re¬ 
lating  to  synchronization.1 

A  weaker  model  offers  fewer  guarantees  about  mem¬ 
ory  consistency,  but  it  ensures  that  a  “well-behaved” 
program  executes  as  though  it  were  running  on  a  se¬ 
quentially  consistent  memory  system.  The  definition 
of  “well-behaved"  varies  according  to  the  model.  For 
example,  in  a  processor  consistent  system,  the  pro¬ 
grammer  may  not  assume  that  all  memory  operations 
are  performed  in  the  same  order  at  all  processors.  (A 
load  or  store  is  globally  performed  when  it  is  per¬ 
formed  with  respect  to  all  processors.  A  load  is  per¬ 
formed  with  respect  to  a  processor  when  no  write  by 
that  processor  can  change  the  value  returned  by  the 
load.  A  store  is  performed  with  respect  to  a  processor 
when  a  load  by  that  processor  will  return  the  value  of 
the  store.)  For  weak  consistency,  the  programmer  may 
not  assume  that  a  processor’s  updates  are  performed 
at  other  processors  until  the  updating  processor  issues 
a  synchronization  operation.  For  release  consistency, 
only  a  processor’s  releasing  synchronization  operation 
guarantees  that  its  previous  updates  will  be  performed 
at  other  processors,  and  only  a  processor’s  acquiring 
synchronization  operation  guarantees  that  other  pro¬ 
cessors’  updates  have  been  performed  at  it.  (A  releas¬ 
ing  synchronization  operation  signals  to  other  proces¬ 
sors  that  shared  data  is  available,  while  an  acquiring 
operation  signals  that  shared  data  is  needed.)  For  en¬ 
try  consistency,  data  is  only  consistent  on  an  acquiring 
synchronization  operation,  and  only  the  data  known 
to  be  guarded  by  the  acquired  object  is  guaranteed  to 
be  consistent. 

Programs  with  good  behavior  do  not  assume  a 
stronger  consistency  guarantee  from  the  memory  sys¬ 
tem  than  is  actually  provided.  Each  model’s  definition 
of  good  behavior  places  demands  on  the  programmer 
to  ensure  that  a  program’s  access  to  shared  data  con¬ 
forms  to  that  model’s  consistency  rules.  For  example, 

'In  practice,  synchronization  accesses  need  only  be  processor 
consistent  [Goodman  ic  Woest  88),  that  is,  writes  issued  from 
a  single  processor  must  be  performed  in  the  order  issued  at 
all  processors,  but  writes  from  different  processors  need  be 
observed  in  the  same  order  every  where.  The  distinction  between 
sequentially  consistent  and  processor  consistent  synchronisation 
is  small,  however  it  is  easier  to  build  a  processor  consistent 
system. 


with  entry  consistency,  a  processor  must  not  access  a 
shared  item  until  it  has  performed  a  synchronization 
operation  on  the  item’s  associated  synchronization  ob¬ 
ject.  These  rules  provide  the  memory  system  with  in¬ 
formation  to  allow  a  well-behaved  program  to  execute 
as  though  it  were  running  on  sequentially  consistent 
memory  system.  Unfortunately,  the  rules  can  add  an 
additional  dimension  of  complexity  to  the  already  dif¬ 
ficult  task  of  writing  new  parallel  programs  and  port¬ 
ing  old  ones.  The  additional  programming  complexity 
can  result  in  higher  performance,  though,  because  it 
provides  greater  control  over  communication  costs. 

1.2  Multiple  models 

Midway  allows  a  programmer  to  navigate  through 
a  subset  of  the  consistency  models,  selecting  one,  or 
several,  to  achieve  an  acceptable  tradeoff  between  per¬ 
formance  and  programmability.  A  program  written  for 
Midway  can  use  entry  consistency,  release  consistency, 
or  processor  consistency.  For  all  of  these  models,  lo¬ 
cal  memories  on  each  processor  cache  recently  used 
data  and  synchronization  objects.  With  entry  consis¬ 
tency,  communication  between  processors  occurs  only 
when  a  processor  acquires  a  synchronization  object. 
Only  the  data  guarded  by  the  synchronization  object 
is  guaranteed  to  become  consistent  at  the  time  of  the 
acquire.  Consequently,  Midway  provides  an  execution 
environment  where  a  parallel  program’s  performance 
is  ultimately  limited  only  by  its  internal  synchroniza¬ 
tion  patterns. 

Although  entry  consistency  enables  the  use  of  low- 
overhead  consistency  mechanisms,  writing  an  entry 
consistent  program  requires  more  work  than  writing 
one  to  a  stronger  model.  For  example,  every  synchro¬ 
nization  object  must  be  identified;  every  use  of  such  an 
object  must  be  explicit;  every  shared  data  item  must 
be  associated  with  a  synchronization  object;  and  syn¬ 
chronization  accesses  should  be  qualified  as  read-only 
or  read-write  for  best  performance. 

To  make  these  restrictions  less  onerous,  Midway 
provides  a  graceful  migration  path  away  from  more 
strongly  consistent  models  to  entry  consistency.  A 
programmer  begins  with  a  processor  consistent  par¬ 
allel  program  or  algorithm  and  sets  Midway’s  consis¬ 
tency  model  “dial”  to  processor  consistency.  The  run¬ 
time  system  can  be  used  to  collect  reference  patterns 
for  shared  data  so  that  strongly  consistent  code  which 
accesses  heavily  shared  data  can  be  reorganized  to  use 
a  weaker  consistency  model.  With  this,  a  programmer 
can  quickly  get  an  application  running  on  the  DSM, 
although  the  application  may  not  run  very  quickly. 

Midway  implements  its  consistency  protocols  in 


software  and  has  no  dependencies  on  any  specific  hard¬ 
ware  characteristic  other  than  the  ability  to  send  mes¬ 
sages  between  processors.  A  strictly  software  solution 
is  attractive  because  it  allows  us  to  exploit  application 
specific  information  at  the  lowest  levels  of  the  system. 
It  also  ensures  portability  across  a  wide  range  of  multi¬ 
computer  architectures.  The  system  described  in  this 
paper  is  operational  on  a  cluster  of  MIPS  R3000-based 
DECstations  running  CMU’s  Mach  3.U  operating  sys¬ 
tem  [Accetta  et  al.  86]  over  both  ethernet  and  an 
ATM  network. 

1.3  Related  work 

Memory  consistency  models  for  DSM  systems  have 
been  implemented  in  both  hardware  and  software. 
Earlier  hardware-based  systems  used  snooping  pro¬ 
tocols  where  each  processor  monitored  a  shared  bus 
to  implement  processor  consistency.  The  Stanford 
DASH  [Lenoski  et  al.  92]  multiprocessor  supports  re¬ 
lease  consistency  in  hardware  using  a  directory-based 
protocol  over  a  dedicated  low-latency  interconnect. 

Most  software  systems  intended  for  parallel  pro¬ 
gramming  have  implemented  these  same  consistency 
models  using  conventional  virtual  memory  manage¬ 
ment  hardware  and  local  area  networks.  Li’s  Ivy  sys¬ 
tem  [Li  86]  described  the  first  implementation  of  such 
a  page-based  DSM  and  was  followed  by  several  other 
systems  [Fleisch  87,  Form  et  al.  89].  Munin  (Carter 
et  al.  91]  is  a  software  system  which  uses  release 
consistency  to  support  automatic  data  caching  over 
a  local  area  network.  Munin  is  unique  among  weak 
consistency  systems,  in  that  it  implements  multiple 
consistency  protocols  which  can  be  used  on  a  type- 
specific  basis.  Munin  uses  hints  from  the  programmer 
to  determine  the  access  patterns  to  shared  data  items, 
and  then  selects  the  best  consistency  protocol  for  each. 
Munin  differs  from  Midway  in  that  it  offers  multiple 
implementations  of  a  single  consistency  model  (release 
consistency),  whereas  Midway  supports  multiple  con¬ 
sistency  models  within  a  single  program. 

Lazy  release  consistency  [Keleher  et  al.  921 
is  a  technique  for  implementing  release  consistency 
through  causal  broadcast.  It  has  been  shown  through 
simulation  to  greatly  reduce  the  number  of  messages 
required  by  systems  such  as  Munin.  Our  work  with  en¬ 
try  consistency  can  be  considered  as  an  extreme  vari¬ 
ant  of  lazy  release  consistency  in  that  Midway’s  ex¬ 
plicit  association  between  synchronization  objects  and 
data  offers  the  runtime  system  additional  information 
about  causality. 


1.4  The  rest  of  this  paper 

In  Section  2  we  describe  entry  consistency.  In  Sec¬ 
tion  3  we  describe  Midway’s  programming  interface  in 
the  context  of  multiple  consistency  models.  In  Sec¬ 
tion  4  we  describe  the  important  aspects  of  Midway’s 
implementation  and  show  that  the  infrastructure  re¬ 
quired  by  entry  consistency  can  be  adapted  to  provide 
each  of  the  stronger  consistency  models.  In  Section  5 
we  discuss  performance.  In  Section  6  we  present  our 
conclusions. 

2  Entry  consistency 

Entry  consistency  takes  advantage  of  the  relation¬ 
ship  between  specific  synchronization  objects  that  pro¬ 
tect  critical  sections  and  the  shared  data  accessed 
within  those  critical  sections.  A  critical  section  is  a 
region  of  code  that  accesses  data  which  may  have  been 
written  by  another  processor.  A  synchronization  ob¬ 
ject  controls  a  processor’s  access  to  the  code  and  data 
in  the  critical  section.  Examples  of  critical  sections 
are  code  sequences  guarded  by  a  mutex,  or  phased  by 
a  barrier.  In  an  entry  consistent  system,  a  proces¬ 
sor’s  view  of  shared  memory  becomes  consistent  with 
the  most  recent  updates  only  when  it  enters  a  critical 
section. 

The  entry  consistent  model  matches  that  already 
used  by  many  shared  memory  parallel  programs, 
namely,  the  use  of  critical  sections  to  guard  access 
to  shared  data  for  which  the  results  of  an  unguarded 
access  is  undefined. 

2.1  Performing  store  operations 

A  consistency  model  does  not  define  whether  store 
operations  ate  performed  at  a  processor  using  an 
invalidation-based  or  an  update-based  protocol.  With 
an  invalidation-based  protocol,  an  operation  is  per¬ 
formed  at  a  remote  processor  by  invalidating  an  entry 
in  that  processor’s  local  cache.  The  processor’s  next 
access  to  the  invalidated  entry  results  in  a  cache  miss 
and  a  round-trip  network  message  to  fetch  the  missed 
value.  With  an  update-based  protocol,  an  operation 
is  performed  at  a  remote  processor  when  the  stored 
value  is  deposited  in  that  processor’s  cache.  This  al¬ 
lows  the  next  access  to  the  item  to  always  be  satisfied 
locally.  The  advantage  of  an  invalidation-based  proto¬ 
col  is  that  consistency  messages  can  be  smaller  because 
they  contain  only  addresses,  not  data.  The  advantage 
of  the  update-based  protocol  is  that  it  greatly  reduces 
the  likelihood  of  a  cache  miss. 


Midway’s  implementation  of  entry  consistency  uses 
an  update-based  protocol.  In  relatively  high  latency 
networks,  where  interprocessor  communication  is  on 
the  order  of  thousands  of  processor  cycles,  the  effect  of 
a  cache  miss  on  processor  performance  can  be  substan¬ 
tial.  For  example,  assuming  a  RISC  processor  with  a 
10  ns  cycle  time,  the  latency  of  resolving  a  cache  miss 
over  an  ATM  network  with  a  100  psec  round-trip  time 
is  on  the  order  of  10,000  instructions.  Consequently,  it 
is  critical  to  use  an  update-based  protocol  to  minimize 
the  chance  that  a  processor  experiences  a  cache  miss. 

An  advantage  of  entry  consistency  with  an  update- 
based  protocol  is  that  interprocessor  communication 
is  only  necessary  during  the  acquisition  of  synchro¬ 
nization  objects.  By  updating  only  at  synchronization 
points,  and  only  between  the  synchronizing  processors, 
new  values  for  data  guarded  by  a  synchronization  ob¬ 
ject  may  be  coalesced  and  delivered  to  a  processor  all 
at  once.  By  ensuring  that  updates  are  performed  with 
respect  to  a  processor  when  it  enters  a  critical  section, 
unexpected  delays  in  a  critical  section  as  a  result  of 
cache  misses  cannot  occur.  Moreover,  no  communica¬ 
tion  is  required  for  repeated  accesses  and  releases  of 
the  same  synchronization  object  on  the  same  proces¬ 
sor  —  common  patterns  in  parallel  programs  [Eggers 
89,  Bennett  et  al.  90j. 

2.2  Caching  synchronization  objects 

Entry  consistency  facilitates  strategies  which  per¬ 
mit  synchronization  objects  to  be  cached  on  the  pro¬ 
cessor^)  where  they  were  most  recently  used.  For  a 
synchronization  object  a,  we  define  the  owner  as  the 
processor  that  last  acquired  a.  Only  the  owner  of  s 
may  perform  updates  to  the  data  guarded  by  $.  The 
processor  that  owns  a  synchronization  object  may  en¬ 
ter  and  exit  the  associated  critical  sections  without 
having  to  communicate  updates  of  shared  memory  to 
other  processors.  A  processor  becomes  an  owner  of 
a  by  sending  a  message  to  the  current  owner.  The 
current  owner  ensures  that  all  updates  to  the  data 
guarded  by  a  are  then  performed  at  the  new  owner. 

An  unfortunate  aspect  of  single  ownership  is  that  no 
more  than  one  processor  at  a  time  can  access  a  given 
shared  location  even  if  the  location  is  only  being  read. 
To  guarantee  consistency,  a  processor  must  hold  the 
appropriate  synchronization  object.  However,  that 
synchronization  object,  if  used  in  the  classical  sense 
(such  as  a  semaphore),  only  permits  mutually  exclu¬ 
sive  access  to  the  data.  Consequently,  straightforward 
use  of  synchronization  objects  to  ensure  consistency 
can  limit  concurrency. 

We  address  this  problem  by  defining  two  modes 


of  access  to  synchronization  objects:  exclusive  and 
n on-excivaive.  Synchronization  objects  continue  to  be 
owned  by  a  single  processor,  but  may  be  replicated  if 
they  are  held  only  in  non-exclusive  mode.  A  processor 
must  perform  an  exclusive  access  to  a  synchronization 
object  s  in  order  to  update  any  data  guarded  by  s. 
By  definition,  that  processor  becomes  the  owner  of 
a.  Reading  data  guarded  by  s,  though,  only  requires 
non-exclusive  access  to  s. 

An  exclusive-mode  access  to  a  synchronization  ob¬ 
ject  a  requires  that  no  other  processor  holds  a  in 
non-exclusive  mode.  After  an  exclusive  mode  access 
to  s  has  been  performed,  any  processor’s  next  non¬ 
exclusive  mode  access  to  a  is  performed  with  respect 
to  the  owner  of  *.  This  enables  a  processor  to  perform 
a  sequence  of  non-exclusive  accesses  to  a  without  hav¬ 
ing  to  communicate  with  s’s  owner  each  time. 

2.3  Programming  to  entry  consistency 

Entry  consistency  makes  several  assumptions  about 
the  behavior  of  parallel  programs  and  the  runtime  en¬ 
vironment.  First,  as  an  instance  of  a  weakly  consis¬ 
tent  protocol,  entry  consistency  requires  that  synchro¬ 
nization  accesses  be  distinguished  from  other  accesses. 
Second,  entry  consistency  requires  an  association  be¬ 
tween  shared  data  and  its  guarding  synchronization 
object.  Third,  to  enable  concurrent  read-sharing,  en¬ 
try  consistency  requires  that  exclusive  synchronization 
accesses  are  distinguished  from  non-exclusive  accesses. 
Finally,  entry  consistency  requires  that  updates  are 
performed  with  respect  to  an  acquiring  processor.  The 
last  constraint  affects  Midway’s  implementation,  while 
the  first  three  affect  its  programming  interface.  Specif¬ 
ically, 

•  All  synchronization  objects  should  be  explicitly 
declared  as  instances  of  one  of  Midway’s  synchro¬ 
nization  data  types,  which  include  locks  and  bar¬ 
riers. 

•  All  shared  data  must  be  explicitly  labeled  with 
the  keyword  shared,  which  is  understood  by  the 
compiler. 

•  All  shared  data  must  be  explicitly  associated  with 
at  least  one  synchronization  object.  This  is  made 
by  calls  to  the  runtime  system,  is  dynamic,  and 
may  change  during  the  execution  of  a  program. 

Programs  that  include  the  necessary  labeling  infor¬ 
mation,  and  precede  all  accesses  to  shared  data  with 
an  access  to  the  appropriate  synchronization  object 
will  observe  a  sequentially  consistent  shared  memory. 


3  Other  models  in  Midway 

A  parallel  program’s  consistency  requirements  can 
be  buried  within  its  algorithms  and  sharing  pat¬ 
terns.  Midway’s  implementation  of  entry  consistency 
requires  that  they  be  made  explicit.  This  may  be  dif¬ 
ficult  because  it  can  require  a  complete  understanding 
of  a  program  or  algorithm,  and  can  be  a  major  barrier 
to  porting  someone  else’s  code. 

From  a  performance  standpoint,  a  complete  trans¬ 
formation  into  entry  consistency  may  not  be  necessary. 
In  many  parallel  programs,  most  communication  is  of 
a  few  primary  data  structures.  While  a  large  num¬ 
ber  of  secondary  data  structures  may  be  used,  they 
are  shared,  or  at  least  modified,  with  low  frequency. 
There  would  be  only  a  marginal  performance  impact 
when  using  a  stronger  consistency  model  to  manage 
these  infrequently  modified  items.  For  example,  some 
programs  maintain  a  set  of  flags  which  change  infre¬ 
quently,  such  as  when  new  data  is  available,  or  when 
am  algorithm  has  terminated.  This  kind  of  data  may 
be  most  easily  managed  as  processor  consistent.  There 
also  exist  programs  that  can  tolerate  minor  inconsis¬ 
tencies  in  their  results,  and  underspecify  their  synchro¬ 
nization.  This  is  done,  for  example  by  locus,  mp3d  and 
plkor  from  the  Splash  application  suite  [Singh  et  al. 
92].  These  programs  can  be  converted  to  entry  con¬ 
sistency  by,  for  example,  binding  rill  data  to  a  barrier, 
but  this  would  oversynchronize  the  processors.  In¬ 
stead,  managing  the  data  with  its  initially  assumed 
consistency  model  may  be  the  best  solution. 

Because  entry  consistency  may  be  hard  to  use  and 
may  not  always  offer  a  performance  advantage,  a  Mid¬ 
way  program  may  also  contain  data  which  is  release 
or  processor  consistent.  Entry  consistent  data  is  asso¬ 
ciated  with  a  synchronization  object.  Data  not  associ¬ 
ated  with  a  synchronization  object,  but  with  a  “flush 
interval”  is  maintained  according  to  processor  consis¬ 
tency.  The  flush  interval  controls  the  rate  at  which 
updates  are  propagated  (in  issue  order)  to  other  pro¬ 
cessors.  An  item  that  is  neither  associated  with  a 
synchronization  object  nor  a  flush  interval  is  assumed 
to  be  release  consistent.  A  processor’s  updates  to  re¬ 
lease  consistent  data  are  performed  at  remote  proces¬ 
sors  only  when  a  release  from  the  updating  processor 
is  necessary  to  satisfy  another  processor’s  acquire. 

4  Implementation 

The  implementation  of  Midway  consists  of  three 
main  components:  a  set  of  keywords  and  function  calls 
used  to  annotate  a  parallel  program,  a  compiler  which 


generates  code  to  maintain  reference  information  for 
shared  data,  and  a  runtime  system  to  implement  sev¬ 
eral  consistency  models. 


4.1  Compiler  and  language  support  for 
Midway 


Midway’s  concurrency  primitives  are  based  on  the 
Mach  C-Threads  interface  [Cooper  fc  Draves  88).  A 
Midway  program  is  written  in  C,  and  looks  like  many 
other  parallel  C  programs  that  use  thread  manage¬ 
ment  directives  such  as  fork  and  join,  and  synchro¬ 
nization  primitives  such  as  lock  and  unlock.  Shared 
data  can  be  allocated  either  dynamically  or  statically, 
but  must  be  tagged  as  shared  during  storage  alloca- 
tion.  All  references  to  shared  data,  however,  do  not 
need  a  shared  qualifier,  so  procedures  can  take  point¬ 
ers  to  data  which  is  either  shared  or  unshared. 

Midway  requires  a  small  amount  of  compile¬ 
time  support  to  implement  its  consistency  protocols. 
Whenever  the  compiler  generates  code  to  store  a  new 
value  into  a  shared  data  item,  it  also  generates  code 
that  marks  the  item  as  “dirty”  in  an  auxiliary  data 
structure.  Other  information  necessary  to  implement 
entry  consistency,  such  as  the  association  between  syn¬ 
chronization  objects  and  guarded  data,  is  specified  at 
runtime  with  procedure  calls  into  Midway’s  runtime 
system. 

An  alternative  to  relying  on  the  compiler  to  gen¬ 
erate  code  which  marks  items  as  dirty  is  to  use  the 
virtual  memory  system  to  trap  writes  to  shared  data. 
This  is  the  approach  taken  with  page-based  systems 
such  as  Ivy  and  Munin.  Although  this  approach  al¬ 
lows  programs  to  run  with  an  unmodified  compiler,  it 
has  several  drawbacks  that  can  limit  its  performance. 
First,  virtual  memory  systems,  and  their  underlying 
MMU  hardware,  do  not  have  particularly  fast  fault 
handling  times  [Appel  k  Li  91],  and  those  times  are 
getting  relatively  slower,  not  faster  [Anderson  et  al. 
91].  Second,  page-based  strategies  can  incur  a  large 
number  of  write-faults  in  the  presence  of  false  shar¬ 
ing.  This  happens  when  unrelated  data  items  on  the 
same  page  are  written  by  different  processors.  Third, 
faults  which  occur  during  a  critical  section  increase  the 
amount  of  time  to  execute  the  critical  section,  thereby 
increasing  contention.  Similarly,  faults  which  occur 
during  a  barrier  sequence  result  in  processors  finish¬ 
ing  at  staggered  times,  ev  though  the  computation 
may  statically  appear  load-oalanced. 


4.2  Synchronization  management 

Distributed  synchronization  management  enables 
processors  to  acquire  synchronization  objects  not 
presently  held  in  their  local  memories.  Two  types  of 
synchronization  objects  are  supported:  locks  and  bar¬ 
riers.  Locks  are  acquired  in  either  exclusive  or  non¬ 
exclusive  mode  by  locating  the  lock’s  owner  using  a 
distributed  queueing  algorithm  [Forin  et  at.  89], 

Barriers  permit  SIMD-style  processing  by  synchro¬ 
nizing  multiple  processors  across  sequential  phases  of 
a  computation.  A  processor  delays  at  a  barrier  until 
all  other  processors  reach  that  same  barrier.  Shared 
data  accessed  within  a  barrier  must  be  made  consis¬ 
tent  only  at  the  point  where  the  barrier  computation 
proceeds  from  one  phase  to  the  next.  Within  a  phase 
there  are  no  consistency  guarantees  for  data  updated 
during  that  phase  (unless  other  synchronization  prim¬ 
itives  are  used). 

Midway  associates  a  manager  processor  with  each 
barrier  synchronization  object.  Processors  “cross”  the 
barrier  by  sending  a  message  to  the  manager  and  wait¬ 
ing  for  a  reply.  The  crossing  message  contains  the  bar¬ 
rier  name  and  all  updates  to  shared  data  associated 
with  the  barrier  that  were  performed  by  the  crossing 
processor.  The  manager  coalesces  the  updated  values 
it  receives  from  all  processors,  then  releases  the  pro¬ 
cessors  by  sending  the  coalesced  updates  back  to  each 
processor.  Midway  also  supports  a  terminating  bar- 
rie-  that  can  be  used  to  coalesce  the  final  results  of  a 
program  at  a  single  processor.  Upon  crossing  a  termi¬ 
nating  barrier,  the  data  is  coalesced  at  the  manager, 
but  is  not  flushed  back  to  the  participating  processors. 

4.3  Cache  management 

Distributed  cache  maiagement  ensures  that  a  pro¬ 
cessor  never  enters  a  critical  action  without  having  re¬ 
ceived  all  updates  to  the  shared  data  guarded  by  that 
synchronization  object.  While  this  condition  could 
be  satisfied  by  transferring  all  shared  data  guarded 
by  the  synchronization  object,  entry  consistency  re¬ 
quires  only  that  updated  data  more  recent  than  that 
contained  in  an  acquiring  processor's  cache  be  trans¬ 
ferred.  To  determine  which  updates  are  more  recent 
than  others,  Midway  uses  Lamport’s  happens-bemre 
relationship  [Lamport  78]  to  impose  a  partial  ordering 
on  updates  to  shared  data  with  respect  to  synchro¬ 
nization  accesses. 

Each  processor  p,  maintains  a  monotonically  in¬ 
creasing  counter  c;  which  serves  as  its  local  clock. 
Whenever  pi  sends  a  message,  for  example  to  synchro¬ 
nize,  to  pj,  it  increments  a  and  includes  c*  in  the  mes- 


sage.  Upon  receipt  of  the  message,  pj  sets  its  clock  Cj 
to  maz(cj  ,C{).  Each  synchronization  object  s  has  an 
associated  timestamp  t,  which  is  set  to  the  value  of  Cj 
whenever  its  ownership  transfers  to  another  processor 
Pj.  Each  shared  data  value  v  guarded  by  s  has  an  as¬ 
sociated  timestamp  tv  that  is  logically  set  to  the  local 
clock  value  whenever  v  is  updated.  When  processor  p, 
requests  s  from  pj,  the  request  contains  pi’s  last  value 
of  t,,  t,.i,  which  is  the  “time”  that  pi  last  observed  s. 
For  each  shared  value  v  guarded  by  s,  if  t9  >  f,.j,  then 
Pi ’s  cache  has  a  stale  version  of  v  and  pj  must  t.  nsfer 
the  new  value  of  v  with  s. 

Timestamps  are  arranged  in  memory  so  that  the 
runtime  system  can  quickly  convert  from  a  shared 
item’s  address  to  its  timestamp.  Midway  avoids  com¬ 
puting  a  timestamp  for  each  update,  delaying  until 
the  timestamp  is  needed  by  the  synchronization  pro¬ 
tocol.  On  store,  the  local  timestamp  field  is  set  to 
zero  to  indicate  that  the  associated  data  item  has 
been  modified.  When  a  synchronization  object  s  is 
requested  from  a  processor  p,-,  all  data  guarded  by  s 
whose  timestamp  is  zero  will  have  their  timestamps 
set  to  Cj.  When  a  shared  data  item  is  allocated,  the 
granularity  of  the  timestamp  (in  effect,  the  cache  line 
size)  can  be  selected  by  the  programmer  according  to 
the  expected  access  patterns  to  the  item.  For  exam¬ 
ple,  a  large  contiguous  object  may  be  backed  by  many 
timestamps  to  improve  the  granularity  of  sharing  and 
update  information.  Timestamp  granularity  can  be  as 
fine  as  a  single  byte. 

Midway  does  not  assume  that  processors  have  in¬ 
finite  caches.  At  any  point,  a  processor  may  discard 
a  shared  data  item  as  long  as  that  processor  does  not 
presently  own  the  synchronization  object  guarding  the 
item.  When  next  acquiring  the  guarding  synchroniza¬ 
tion  object  s,  the  discarding  processor  indicates  tnat 
it  has  not  held  s  for  a  “very  long  time.” 

4.4  Supporting  stronger  consistency 
models 

Much  of  Midway’s  infrastructure  for  entry  consis¬ 
tency  can  be  leveraged  to  support  processor  consis¬ 
tency  and  release  consistency.  Supporting  these  other 
models  requires  that  the  compiler  and  runtime  detect 
writes  to  shared  data,  perform  updates  at  other  pro¬ 
cessors  in  the  order  required  by  the  model,  and  recover 
from  cache  misses  which  occur  when  a  processor  ac¬ 
cesses  a  shared  data  item  not  present  in  its  local  cache. 
For  this,  Midway  overloads  the  timestamp  mechanism 
described  earlier.  The  compiler  emits  the  same  code 
for  a  store  to  a  shared  address,  but,  at  runtime,  the 
computed  timestamp  is  treated  differently. 


Data  items  maintained  according  to  processor  or 
release,  but  not  entry,  consistency,  initially  have  the 
high  bit  in  their  associated  timestamp  set.  On  a  store 
to  the  line,  if  the  high  bit  of  the  timestamp  field  is 
set,  then  the  store  will  be  performed  at  all  other  pro¬ 
cessors  independent  of  any  particular  synchronization 
operation.  The  modified  address  is  recorded  in  a  per- 
processor  queue  of  pending  updates  and  the  high  bit 
of  the  timestamp  is  cleared  to  ensure  the  item  is  not 
queued  again.  This  strategy  for  queue  updates  allows 
us  to  use  the  same  compiler-emitted  code  sequence 
when  updating  both  entry  consistent  and  non-entry 
consistent  data.  Once  the  non-entry  consistent  data 
has  been  queued,  subsequent  stores  continue  to  clear 
the  timestamp  field  just  as  if  the  data  were  entry  con¬ 
sistent. 

Associated  with  each  cache  line  i',  a  copysci  that 
defines  the  processors  holding  a  copy  of  the  line  in 
their  local  memories.  A  pending  update  to  a  line  need 
only  be  performed  at  the  processors  in  that  line’s  copy- 
set.  For  release  consistent  data,  pending  updates  are 
flushed  by  a  processor  any  time  the  processor’s  re¬ 
leasing  synchronization  operation  is  performed  at  an¬ 
other  acquiring  processor.  Processor  consistent  data 
is  flushed  at  taese  points,  as  well  as  at  the  periodic 
interval  specified  by  the  programmer.  In  either  case, 
when  an  update  is  flushed  by  a  processor,  the  high  bit 
in  the  item’s  timestamp  is  set  to  catch  that  processor  's 
next  store  to  the  item  at  that  processor. 

Under  entry  consistency,  all  data  bound  to  a  syn¬ 
chronization  object  is  prefetched  when  the  object  is 
acquired.  With  processor  consistency  and  release  con¬ 
sistency,  there  is  no  associated  synchronization  object 
and  no  way  to  know  which  data  to  prefetch  or  when  to 
prefetch  it.  Consequently,  on  a  processor’s  first  refer¬ 
ence  to  an  item,  that  processor  will  not  be  in  the  item’s 
copyset  and  will  not  have  received  any  prior  updates. 
We  detect  this  condition  with  a  copyset  fault,  which  is 
implemented  with  virtual  memory  page  faults. 

Each  processor  marks  virtual  memory  pages  that 
contain  cache  lines  for  which  the  processor  is  not  in 
the  copyset  as  no-access.  Access  to  such  a  page  causes 
a  page  fault,  and  the  faulting  processor  fetches  the 
faulted  page  from  the  page's  home  processor.  Every 
page  has  a  home,  based  on  the  page’s  virtual  address. 
Before  the  home  node  returns  the  page’s  data,  the 
faulting  node  is  added  to  the  page’s  copyset  and  ail 
other  members  are  informed  of  the  change.  Any  pro¬ 
cessor  but  the  home  processor  may  remove  itself  from 
the  copyset  of  a  page  by  notifying  the  page's  home 
processor. 

With  this  strategy,  all  cache  lines  within  a  virtual 


5.2  Matrix  multiply 


page  are  part  of  the  same  copyset.  A  page  fault  oc¬ 
curs  only  when  a  processor  adds  itself  to  a  cache  line’s 
copyset;  otherwise  runtime  write  detection  to  shared 
data  is  done  with  compiler-emitted  code,  rather  than 
with  faults.  Thus  multiple  processors  may  write  to  a 
page  at  the  same  time. 

For  all  three  consistency  models,  Midway  uses  up¬ 
date,  rather  than  invalidate,  to  perform  writes  to  other 
processors.  This  is  done  for  two  reasons.  First,  it  guar¬ 
antees  that  processors  never  experience  a  cache  miss 
except  on  a  copyset  fault.  Second,  it  enables  the  use  of 
the  same  update  machinery  used  for  entry  consistency. 

5  Performance 

In  this  section  we  look  at  the  behavior  of  a  sim¬ 
ple  parallel  program  under  Midway  us  ig  both  entry 
consistency  and  release  consistency.  For  our  measure¬ 
ments,  we  use  two  metrics  which  we  present  as  a  func¬ 
tion  of  the  number  of  processors  on  which  an  appli¬ 
cation  runs:  execution  time  and  message  count.  Our 
measurements  show  that  a  program  written  to  entry 
consistency  requires  substantially  fewer  messages  than 
one  to  the  stronger  models.  This  translates  into  im¬ 
proved  execution  time. 

5.1  The  hardware  platforms 

Our  implementation  of  Midway  runs  on  MIPS 
R3000-based  DECstation  5000/120s  and  5000/200s 
on  top  of  two  networks:  a  IOMb/sec  ethernet  and 
a  155  Mb/sec  ATM  network.  The  operating  sys¬ 
tem  is  Mach  3.0  with  CMU’a  Unix  server  [Golub 
et  al.  90).  The  DECstation  5000/120  uses  a  20Mhz 
R3000  with  a  12.5Mhz  Tbrbochannel  (TC)  bus  in¬ 
terface.  The  5000/200  uses  a  25Mhz  E3000  with  a 
25Mhz  Turbochannel  interface.  Presently,  we  have 
only  four  155Mb/sec  ATM  network  interfaces,  and  we 
use  these  to  connect  the  faster  and  more  network  ca¬ 
pable  5000/200s  through  the  central  switch. 

We  present  results  on  three  configurations:  stow- 
ether ,  which  uses  the  slower  DS50Q0/120s  con¬ 
nected  by  ethernet;  fast-ether ,  which  uses  the  faster 
DS5000/2008  connected  by  ethernet,  and  fasl-atm, 
which  is  like  fast-ether,  except  that  we  use  the  ATM 
network  instead.  The  slow-ether  configuration  allows 
us  to  evaluate  performance  on  more  than  four  proces¬ 
sors,  while  the  fast-ATM  network  lets  us  look  at  the 
system’s  behavior  on  a  faster  network  with  faster  pro¬ 
cessors.  We  include  the  fast-ether  numbers  to  provide 
a  common  point  of  comparison  between  slow-ether  and 
fast-ATM. 


We  present  the  results  of  a  simple  matrix  multiply 
application  running  on  several  processors  to  provide 
an  integrity  rh  ck  for  Midway,  to  show  the  interac¬ 
tion  between  processor  speed  and  network  speed,  and 
to  compare  the  behavior  of  a  single  program  under  two 
different  consistency  models.  MM-ec  is  a  matrix  mul¬ 
tiply  program  which  multiplies  two  512  x  512  floating 
point  matrices  on  from  one  to  eight  processors  using 
entry  consistency.  A  master  processor  writes  the  two 
input  matrices  A  and  B,  and  *hen  spawns  slave  pro¬ 
cesses  on  the  other  machines.  The  master  and  each 
slave  machine  acquires  a  non-exclusive  (read)  lock  on 
all  of  A,  and  a  non-exclusive  lock  on  a  portion  of  B. 
Each  processor  then  computes  its  pon.:on  of  the  result 
matrix  C,  and  crosses  a  final  barrier.  It  returns  that 
portion  of  C  that  it  has  written  back  to  the  master. 

The  main  advantage  of  using  entry  consistency  for 
matrix  multiply  is  that  the  initial  message  required  to 
satisfy  each  slave  processor’s  non-exclusive  lock  acqui¬ 
sition  transfers  the  input  matrices.  Similarly,  when  a 
slave  terminates,  the  message  it  sends  indicating  that 
it  has  crossed  a  barrier  also  transfers  the  results  back 
to  the  master.  In  effect,  the  consistency  model  in  com¬ 
bination  with  a  synchronization  protocol  natural  to 
the  problem  provides  optimal  clustering  of  data,  both 
in  terms  of  number  of  bytes  transferred  and  number 
of  messages  generated. 

To  assess  the  importance  of  this  clustering,  we  have 
written  a  version  of  matrix  multiply  to  use  release, 
rather  than  entry,  consistency  (MM-rc).  Table  1  shows 
the  elapsed  time,  speedup,  time  spent  computing  and 
transferring  data,  amount  of  data  transferred,  and 
the  total  number  of  messages  for  MM-ec  and  MM-rc. 
Since  fast-ATM  represents  the  best  possible  configu¬ 
ration,  we  show  the  results  of  MM-rc  only  for  it.  In 
the  single-processor  case,  we  compiled  the  program 
to  run  on  a  uniprocessor,  eliminating  synchronization 
and  timestamp  management  overhead. 

Elapsed  time  is  the  interval  beginning  after  all  pro¬ 
cessors  have  started  and  ending  when  the  master  pro¬ 
cessor  terminates.  This  includes  the  time  to  move  the 
input  and  output  data  sets  between  processors.  The 
data  transfer  times  shown  are  from  the  perspective 
of  the  master  processor  (which  sources  and  sinks  all 
data).  The  input  and  output  data  sizes  are  only  a 
function  of  the  number  of  processors  (problem  par¬ 
titioning),  and  not  the  processor,  network  or  consis¬ 
tency  model.  The  number  of  messages  is  a  function  of 
the  number  of  processors  and  the  consistency  model. 
The  compute  time  is  the  time  spent  actually  working 
on  the  matrices  and,  as  shown,  scales  with  the  number 


# 

Procs. 

Elapsed 

(secs) 

Speedup 

Input 

Transfer 

(secs) 

Compute 

Results 

(secs) 

Output 

Transfer 

(secs) 

Data 

Transferred 

(mbytes) 

# 

MM-ec.  slow-ether 

1 

282 

1 

wum 

0 

0 

0 

2 

148.4 

1.90 

DH 

132.4 

3.05 

2.14 

24 

4 

84.9 

3.32 

■HI 

66.3 

6.51 

4.81 

72 

6 

4.34 

13.72 

44.4 

6.92 

7.13 

120 

8 

58.6 

4.81 

17.56 

33.2 

7.87 

9.36 

168 

MM-ec:  fast-ether 

1 

164 

1 

0 

0 

0 

2 

92.8 

1.77 

8.70 

81.4 

2.66 

2.14 

24 

4 

53.3 

8.50 

4.17 

4.81 

72 

MM-ec:  fast-ATM 

l 

164 

1 

0 

0 

0 

2 

83.5 

1.96 

1.53 

81.7 

■ 

2.14 

24 

4 

43.3 

3.79 

1.86 

IB 

4.81 

72 

MM-rc.  fast-ATM 

1 

164 

1 

o 

0 

0 

0 

0 

2 

86.8 

1.89 

3.14 

82.0 

1.61 

2.17 

1802 

4 

48.4 

3.39 

6.06 

41.1 

1.28 

4.97 

5106 

Table  1:  Breakdown  of  the  performance  of  matrix  multiply  using  both  entry  consistency  and  release  consistency. 


of  processors. 

Looking  only  at  the  entry  consistent  configurations, 
speedup  is  slightly  less  than  linear  because  the  com¬ 
munication  overheads  do  not  decrease  as  processors 
are  added.  The  fact  that  speedup  at  4  processors  for 
fast-ether  is  worse  than  for  slow-ether  shows  the  im¬ 
pact  of  increasing  processor  performance  without  in¬ 
creasing  network  performance.  A  single  DGCstation 
5000/200  is  roughly  twice  as  fast  at  matrix  multiply 
as  a  DECstation  5000/120.  Communication  over¬ 
head  on  both  systems  is  roughly  the  same,  therefore 
there  is  less  room  for  improvement  when  running  on 
a  network  of  5000/200s.  In  contrast,  speedup  is  much 
closer  to  linear  on  the  fast-ATM  network  where  com¬ 
munication  overhead  is  less  than  6%  of  total  execution 
time  (as  opposed  to  almost  25%  on  the  fast-ether  con¬ 
figuration). 

By  point  of  comparison,  the  Munin  system  yielded 
an  8-fold  speedup  for  an  8  processor  matrix  multiply 
using  release  consistency  running  over  a  10  Mb/sec 
ethernet  [Carter  et  al.  91).  The  processors  used  there, 
however  were  substantially  slower  than  those  used  in 
our  fast-ether  configuration.  A  single  Munin  proces¬ 
sor  could  compute  the  product  of  two  400x400  inte¬ 
ger  matrices  in  a  little  over  700  seconds,  almost  five 
times  slower  than  a  DS5000/200  running  on  a  larger 
(512x512)  and  harder  (floatingpoint)  input  set.  From 
this  we  conclude  that  uniprocessor  performance  has 
reached  the  point  where  ethernet  is  no  longer  a  vi¬ 
able  network  for  parallel  processing,  even  for  well- 
structured  parallel  applications  such  as  matrix  mul¬ 
tiply. 

Moving  to  the  release  consistent  runs,  we  can  see 


the  impact  of  Midway’s  implicit  prefetch  that  comes 
during  lock  acquisition  for  entry  consistency.  Instead 
of  transferring  the  entire  input  and  output  matrices  in 
single  messages,  as  is  done  with  entry  consistency,  the 
release  consistent  implementation  misses  frequently  as 
can  be  seen  by  the  number  of  messages  sent.  Most  of 
the  messages  correspond  to  the  transfer  of  a  page  of 
data.  Specifically,  each  message  for  the  entry  con¬ 
sistent  run  corresponds  to  a  synchronization  request, 
which  also  occurs  during  the  release  consistent  run. 
The  additional  messages  for  the  release  consistent  run 
are  for  data  transfer.  The  elapsed  time  difference  be¬ 
tween  the  release  and  entry  consistent  runs  is  due  to 
tue  overhead  of  having  to  send  more  messages,  even 
though  the  total  amount  of  data  transferred  is  nearly 
unchanged. 


6  Conclusions 

There  exist  a  range  of  memory  consistency  mod¬ 
els  that  provide  different  kinds  of  behavior  both  in 
terms  of  semantics  and  performance.  Strongly  con¬ 
sistent  memory  systems  simplify  porting  and  reason¬ 
ing  about  programs  written  for  shared  memory  mul¬ 
tiprocessors  but  are  limited  in  their  ability  to  conceal 
latency.  Entry  consistency  lakes  into  account  both 
synchronization  behavior  and  the  relationship  between 
synchronization  objects  and  data.  This  allows  the  run¬ 
time  system  to  hide  the  network  overhead  of  memory 
references  by  folding  all  memory  updates  into  synchro¬ 
nization  operations. 
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